Libraries
We load the packages relevant for the exercise.
library(FactoMineR)
library(tidyr)
library(dplyr)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(ggpubr)
library(factoextra)
library(gridExtra)
library(moments)
Screw Caps Data
The data ScrewCap.csv contains 195 lots of screw caps described by 11 variables. Diameter, weight, length are the physical characteristics of the cap; nb.of.pieces corresponds to the number of elements of the cap (the picture above corresponds to a cap with 2 pieces: the valve (clapet) is made of a different material); Mature.volume corresponds to the number of caps ordered and bought by the company (number in the lot).
raw_data <- read.table("ScrewCaps.csv",header=TRUE, sep=",", dec=".", row.names=1)
head(raw_data)
dim(raw_data)
[1] 195 11
summary(raw_data)
Supplier Diameter weight nb.of.pieces Shape Impermeability Finishing Mature.Volume Raw.Material Price Length
Supplier A: 31 Min. :0.4458 Min. :0.610 Min. : 2.000 Shape 1:134 Type 1:172 Hot Printing: 62 Min. : 1000 ABS: 21 Min. : 6.477 Min. : 3.369
Supplier B:150 1st Qu.:0.7785 1st Qu.:1.083 1st Qu.: 3.000 Shape 2: 45 Type 2: 23 Lacquering :133 1st Qu.: 15000 PP :148 1st Qu.:11.807 1st Qu.: 6.161
Supplier C: 14 Median :1.0120 Median :1.400 Median : 4.000 Shape 3: 8 Median : 45000 PS : 26 Median :14.384 Median : 8.086
Mean :1.2843 Mean :1.701 Mean : 4.113 Shape 4: 8 Mean : 96930 Mean :16.444 Mean :10.247
3rd Qu.:1.2886 3rd Qu.:1.704 3rd Qu.: 5.000 3rd Qu.:115000 3rd Qu.:18.902 3rd Qu.:10.340
Max. :5.3950 Max. :7.112 Max. :10.000 Max. :800000 Max. :46.610 Max. :43.359
2) We start with univariate and bivariate descriptive statistics. Using appropriate plot(s) or summaries answer the following questions.
a) How is the distribution of the Price? Comment your plot with respect to the quartiles of the Price.
From the quantile data, the summary statistics are given by: median, 1Q and 3Q as 14.432, 11.864 and 19.04 respectively.
The plots, the kurtosis and the skewness parameters suggest the price follows a bimodal distribution that is “skewed right”. The major mode is around 14 and the antimode is around 29. Furthermore, 50% of the prices in the range 11.864 and 19.04. This is consistent with graph where the majority of the density is concentrated inside this range and a long right tail of prices outside.
The boxplot supports this analyis and suggests the values in the tail are outliers.
price_density <- ggdensity(raw_data,x="Price",y = "..count..",
color="darkblue",
fill="lightblue",size=0.5,
alpha=0.2,
title = "Screw Cap Price Distribution",
linetype = "solid", add = c("median"))+ font("title", size = 12,face="bold")
price_boxplot <- ggboxplot(raw_data$Price, width = 0.1, fill ="lightgray", outlier.colour = "darkblue", outlier.shape=4.2, ylab = "Price", xlab = "Screw Caps" , title = "Price Box Plot") + rotate() + font("title", size = 12,face="bold")
price_quantile <- quantile(raw_data$Price)
ggarrange(price_density, price_boxplot, ncol = 1, nrow = 2)
price_quantile
0% 25% 50% 75% 100%
6.477451 11.807022 14.384413 18.902429 46.610372
skewness(raw_data$Price)
[1] 1.706151
kurtosis(raw_data$Price)
[1] 6.395453
b) Does the Price depend on the Length? weight?
We examine Price vs. Length, log(Price) vs. log(Length); Price vs. weight, log(Price) vs. log(weight) and provide the summary for each.
The plots suggest somewhat of a relationship between the variables, but observing the R-sqaured values and the results of the F and T tests confirm this to a high degree of significance.
price_length <- ggplot(raw_data, aes(x=Length, y=Price)) + geom_point() + geom_smooth(method=lm, color="darkgreen")+ theme_minimal()
price_length_log <- ggplot(raw_data, aes(x=log(Length), y=log(Price))) + geom_point() + geom_smooth(method=lm, color="darkgreen")+ theme_minimal()
price_weight <- ggplot(raw_data, aes(x=weight, y=Price)) + geom_point() + geom_smooth(method=lm,color="red")+theme_minimal()
price_weight_log <- ggplot(raw_data, aes(x=log(weight), y=log(Price))) + geom_point() + geom_smooth(method=lm,color="red")+theme_minimal()
ggarrange(ggarrange(price_length, price_length_log, ncol = 2, nrow = 1), ggarrange(price_weight, price_weight_log, ncol = 2, nrow = 1), ncol = 1, nrow = 2)
summary(lm(formula = Price ~ Length, raw_data))
Call:
lm(formula = Price ~ Length, data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-13.901 -2.854 -0.741 1.931 16.181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.94613 0.50918 17.57 <2e-16 ***
Length 0.73168 0.03953 18.51 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.308 on 193 degrees of freedom
Multiple R-squared: 0.6397, Adjusted R-squared: 0.6378
F-statistic: 342.6 on 1 and 193 DF, p-value: < 2.2e-16
summary(lm(formula = log(Price) ~ log(Length), raw_data))
Call:
lm(formula = log(Price) ~ log(Length), data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-0.70368 -0.15501 -0.01661 0.15170 0.59211
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.56380 0.07278 21.49 <2e-16 ***
log(Length) 0.53875 0.03282 16.42 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2466 on 193 degrees of freedom
Multiple R-squared: 0.5827, Adjusted R-squared: 0.5805
F-statistic: 269.5 on 1 and 193 DF, p-value: < 2.2e-16
summary(lm(formula = Price ~ weight, raw_data))
Call:
lm(formula = Price ~ weight, data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-14.7993 -2.6207 -0.6631 2.5396 13.8357
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.2275 0.5602 14.69 <2e-16 ***
weight 4.8312 0.2718 17.78 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.419 on 193 degrees of freedom
Multiple R-squared: 0.6208, Adjusted R-squared: 0.6189
F-statistic: 316 on 1 and 193 DF, p-value: < 2.2e-16
summary(lm(formula = log(Price) ~ log(weight), raw_data))
Call:
lm(formula = log(Price) ~ log(weight), data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-0.71123 -0.15340 -0.01343 0.17735 0.69552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.50618 0.02333 107.42 <2e-16 ***
log(weight) 0.56453 0.03718 15.18 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2577 on 193 degrees of freedom
Multiple R-squared: 0.5443, Adjusted R-squared: 0.5419
F-statistic: 230.5 on 1 and 193 DF, p-value: < 2.2e-16
c) Does the Price depend on the Impermeability? Shape?
Concerning Impermeability, the plots below show that there are some striking differences between the price distribution for Type 1 and Type 2, in particular observing the medians and the IQR.
Concerning Shapes, it is difficult to make any real conclusions regarding shape 3 and shape 4 given there are so few data points. We turn our attention to Shape 1 and Shape 2 - the IQR for these two shapes are seemingly different. This is confirmed by the result of the T Test.
impermability_plot_1 <- ggdotplot(raw_data,x="Impermeability",y="Price",color = "Impermeability", palette = "jco",binwidth = 1,legend="none")
shape_plot_1 <- ggdotplot(raw_data,x="Shape",y="Price",color = "Shape", palette = "npg",binwidth = 1,legend="none")
impermability_plot_2 <- ggboxplot(raw_data,x="Impermeability",y="Price",color = "Impermeability", palette = "jco",legend="none")
shape_plot_2 <- ggboxplot(raw_data,x="Shape",y="Price",color = "Shape", palette = "npg", legend = "none")
ggarrange(ggarrange(impermability_plot_1,impermability_plot_2,ncol = 2, nrow = 1),
ggarrange(shape_plot_1,shape_plot_2,ncol = 2, nrow = 1),
ncol = 1, nrow = 2)
summary(lm(Price~ Impermeability, data=raw_data))
Call:
lm(formula = Price ~ Impermeability, data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-16.4106 -3.0187 -0.6286 2.4897 25.0638
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.7236 0.4117 35.77 <2e-16 ***
ImpermeabilityType 2 14.5846 1.1986 12.17 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.399 on 193 degrees of freedom
Multiple R-squared: 0.4341, Adjusted R-squared: 0.4312
F-statistic: 148 on 1 and 193 DF, p-value: < 2.2e-16
summary(lm(Price~ Shape, data=raw_data))
Call:
lm(formula = Price ~ Shape, data = raw_data)
Residuals:
Min 1Q Median 3Q Max
-11.098 -3.850 -1.025 3.055 25.587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.2006 0.5406 26.267 < 2e-16 ***
ShapeShape 2 8.1403 1.0782 7.550 1.75e-12 ***
ShapeShape 3 1.4510 2.2777 0.637 0.52485
ShapeShape 4 7.4393 2.2777 3.266 0.00129 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.258 on 191 degrees of freedom
Multiple R-squared: 0.2475, Adjusted R-squared: 0.2357
F-statistic: 20.94 on 3 and 191 DF, p-value: 9.008e-12
d) Which is the less expensive Supplier?
The answer to this question depends on the definition of expensive.
First, examine the following absolute metrics (this can be seen via the boxplot) 1) Absolute price - Supplier B cheapest (6.477451). However, Supplier B is also the supplier which has the highest absolute price (46.610372) 2) Average Price - Supplier C cheapest (14.88869)
Second, examine the following relative metrics:
3) Average Price / Unit Length - Supplier A (1.505043) 4) Average Price / Unit weight - Supplier A (9.013902) 5) Average Price / Unit Diameter - Supplier A (11.95632)
The result above suggest Supplier A has the cheapest average price per unit of production.
The analysis however is not complete given we do not have a definition of cheapest price. Even the scatter and box plots below suggest suppliers may cater to specific product ranges. It also ignores the categorical data which could provide some insights into cheapest price for certain product features Furthermore, we have not performed statistical tests to examine the significance of these differences.
supplier_plot_1 <- ggboxplot(raw_data,x="Supplier",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),legend="none") + rotate()
supplier_plot_2 <- ggscatter(raw_data,x="Length",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),xscale= "log10", yscale="log10")
supplier_plot_3 <- ggscatter(raw_data,x="weight",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),xscale= "log10", yscale="log10")
supplier_plot_4 <- ggscatter(raw_data,x="Diameter",y="Price",color = "Supplier", palette = c("darkblue","red","darkgreen"),xscale= "log10", yscale="log10")
supplier_statistics <- raw_data %>% group_by(Supplier) %>% summarise( "Average Price" = mean(Price), "Average Length" = mean(Length),"Average weight" = mean(weight),"Average Diameter" = mean(Diameter), "Average Price / Length" = mean(Price)/mean(Length), "Average Price / weight" = mean(Price)/mean(weight), "Average Price / Diameter" = mean(Price)/mean(Diameter))
supplier_plot_1
supplier_plot_2
supplier_plot_3
supplier_plot_4
head(supplier_statistics)
3) One important point in explanatory data analysis consists in identifying potential outliers. Could you give points which are suspect regarding the Mature.Volume variable? Give the characteristics (other features) of the observations that seem suspsect
There are four data points which seem suspect - they have the same characteristics for Diameter, weight, nb.of.pieces, Impermeability, Finishing, Raw.Material and Mature.Volume. They differ in their supplier, price and length. These suggest some error in collating the data (system error / default data).
Mature.Volume_plot <- gghistogram(raw_data,x="Mature.Volume",y="..count..", color = "darkblue", fill = "lightgrey") + theme_minimal()
Using `bins = 30` by default. Pick better value with the argument `bins`.
Mature.Volume_plot
raw_data %>% filter (Mature.Volume > 6e+05 )
For the rest of the analysis, the 4 data points above are disregarded.
library(dplyr)
raw_data <- raw_data %>% filter (Mature.Volume < 6e+05 )
We will quickly check that there are no other noticeable outliers - this is indeed the case.
check_1 <- gghistogram(raw_data,x="Length",y="..count..", color = "darkblue", fill = "lightgrey") + theme_minimal()
Using `bins = 30` by default. Pick better value with the argument `bins`.
check_2 <- gghistogram(raw_data,x="Diameter",y="..count..", color = "darkblue", fill = "lightgrey") + theme_minimal()
Using `bins = 30` by default. Pick better value with the argument `bins`.
check_3 <- gghistogram(raw_data,x="weight",y="..count..", color = "darkblue", fill = "lightgrey") + theme_minimal()
Using `bins = 30` by default. Pick better value with the argument `bins`.
check_4 <- gghistogram(raw_data,x="nb.of.pieces",y="..count..", color = "darkblue", fill = "lightgrey",bins=40) + theme_minimal()
ggarrange(ggarrange(check_1,check_2,ncol=2,nrow=1),ggarrange(check_3,check_4,ncol=2,nrow=1),ncol=1,nrow=2)
4) Perform a PCA on the dataset ScrewCap, explain briefly what are the aims of a PCA and how categorical variables are handled?
Principal components analysis (PCA) is a technique for taking high-dimensional data, and using the dependencies between the variables to represent it in a more tractable, lower-dimensional form, without losing too much information - we try capture the essence of high dimentional data in a low dimensional representation. The aim of PCA is to draw conclusions from the linear relationships between variables by detecting the principal dimensions of variability. This may be for compression, denoising, data completion, anomaly detection or for preprocessing before supervised learning (improve performance / regularization to reduce overfitting).
The categorical variables cannot be represented in the same way as the supplementary quantitative variables since it is not possible to calculate the correlation between a categorical variable and the principal components. The categorical variables here are handled as supplemetary variables on a purely illustrative basis - they are not used to calculate the distance between inidividuals. We represent a categorical variable at the barycentre of all the individuals possessing that variable. A categorical variable on the PCA performed below can therefore be regarded as the mean individual obtained from the set of individuals who have it.
Given our ultimate goal here is to explore data prior to a multiple regression, it is advisable to choose the explanatory variables for the regression model as active variables for PCA, and to project the variable to be explained (the dependent variable) as a supplementary variable. This gives some idea of the relationships between explanatory variables and thus of the need to select explanatory variables. This also gives us an idea of the quality of the regression: if the dependent variable is appropriately projected, it will be a well-fitted model. Thus we select Price as a supplementary variable.
The dataset in this exercise contains 6 supplementary variables: - 1 quantitative variable (Price) - 5 qualitative variables (Supplier, Shape, Impermeability and Finishing).
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10, graph = FALSE)
fviz_pca_ind(res.pca, col.ind="cos2", label=c("quali"), geom = "point", title = "Individual factor map (PCA)", habillage = "none") + scale_color_gradient2(low="lightblue", mid="blue", high="darkblue", midpoint=0.6) + theme_minimal()
plot.PCA(res.pca,choix = c("ind"),invisible = c("ind"))+theme_minimal()
NULL
plot.PCA(res.pca,choix = c("var"))+theme_minimal()
NULL
5) Compute the correlation matrix between the variables and comment it with respect to the correlation circle
The first task is to center and standardize the variables. Then the correlation matrix is computed. All variable vectors are quite near to the boundary of the correlation circle on the variables plot - thus the variables are relatively well projected on the 2 dimensional subspace. We now turn our attention to correlations between variables.
The correlations can be visualised through the angles between variables on the correlation matrix. This can be related to the correlation matrix (small angles suggest large positive correlation, 90 degree angles suggest no correlation, 180 degree angles suggest large negative correlation). - Diameter, Length and weight expose very strong corrleation: the angle between them is close to 0, suggesting correlation close to 1. These are all very highly correlated to the first Principal Component. - The three variables above are at an angle sightly wider than a right angle to both nb.of.pieces and Mature.Volume in the cirlce which suggests slightly negative correlation. - Price is highlighly correlatd to the three variables above - Equally, Mature.Volume and nb.of.pieces are at a slightly wider angle than a right angle which suggests slightly negative correlation - this suggests that when the screw caps have a high number of pieces, the company orders a smaller volume of these. These are well projected on the second principal component.
don <- as.matrix(raw_data[,-c(1,5,6,7,9,10)]) %>% scale()
don_correlation <- cor(don)
don_correlation
Diameter weight nb.of.pieces Mature.Volume Length
Diameter 1.0000000 0.9622544 -0.14869500 -0.29164724 0.9996963
weight 0.9622544 1.0000000 -0.16884367 -0.31321323 0.9627460
nb.of.pieces -0.1486950 -0.1688437 1.00000000 -0.07462463 -0.1463770
Mature.Volume -0.2916472 -0.3132132 -0.07462463 1.00000000 -0.2936330
Length 0.9996963 0.9627460 -0.14637705 -0.29363295 1.0000000
plot.PCA(res.pca,choix = c("var"))+theme_minimal()
NULL
6) On what kind of relationship PCA focuses? Is it a problem?
PCA focuses on the linear relationships between continuous variables. Given complex links also exist, such as quadratic relationships, logarithmics, exponential functions, and so forth, this may seem restrictive, but in practice many relationships can be considered linear, at least for an initial approximation. However, there is obviously non-linear datasets for which PCA will have pitfalls (e.g. spiral dataset). Furthermore, in PCA categorical variables cannot be active variables, which can be restrictive.
7) Comment the PCA outputs
Comment the position of the categories Impermeability=type 2 and Raw.Material=PS.
The coordinates for Type 2 are (3.30430162 , 0.0020023422) for the first two principal components The coordinate for PS are (2.69084507 -0.2539199538) for the first two principal components
Both categories have a high coordinate for the first principal component. Given the correlation circle shows high correlation between the first component and price, diameter, length and weight, this suggest Type 2 and PS have high values for these variables.
res.pca$quali.sup$coord
Dim.1 Dim.2 Dim.3
Supplier A 0.54805992 -0.054566515 -0.214051234
Supplier B -0.06543165 -0.125589918 -0.026949041
Supplier C -0.44356100 1.440695488 0.728281700
Shape 1 -0.42564773 -0.137916238 -0.214123559
Shape 2 1.42726960 0.394279456 0.383010989
Shape 3 -0.55969671 -0.332207048 0.059604995
Shape 4 -0.55191919 0.355523978 1.265466019
Type 1 -0.45031131 -0.001823621 -0.009194259
Type 2 3.28923043 0.013320364 0.067158065
Hot Printing -0.28600503 -0.037712714 0.192161713
Lacquering 0.13745978 0.018125491 -0.092356792
ABS 0.87599666 0.220028373 -0.581512708
PP -0.61062316 0.013651457 0.120658551
PS 2.67437709 -0.253323291 -0.198579404
dimdesc(res.pca)
$Dim.1
$Dim.1$quanti
correlation p.value
Length 0.9853764 3.259183e-147
Diameter 0.9851090 1.784008e-146
weight 0.9774643 1.263294e-129
Price 0.7960132 4.472456e-43
nb.of.pieces -0.2017085 5.139018e-03
Mature.Volume -0.4118157 3.243173e-09
$Dim.1$quali
R2 p.value
Impermeability 0.4767041 2.203784e-28
Raw.Material 0.4309747 9.602186e-24
Shape 0.2024825 3.268025e-09
$Dim.1$category
Estimate p.value
Type 2 1.8697709 2.203784e-28
PS 1.6944602 3.078822e-20
Shape 2 1.4547681 6.874053e-11
ABS -0.1039202 1.566216e-02
Shape 1 -0.3981492 5.692581e-07
PP -1.5905400 1.465743e-20
Type 1 -1.8697709 2.203784e-28
$Dim.2
$Dim.2$quanti
correlation p.value
nb.of.pieces 0.8426737 1.045066e-52
Price 0.1706662 1.824955e-02
Mature.Volume -0.5956751 9.962493e-20
$Dim.2$quali
R2 p.value
Supplier 0.15447684 1.411880e-07
Shape 0.05575798 1.311396e-02
$Dim.2$category
Estimate p.value
Supplier C 1.0205158 1.996715e-08
Shape 2 0.3243594 3.249853e-03
Shape 1 -0.2078363 6.889304e-03
Supplier B -0.5457696 1.703585e-03
$Dim.3
$Dim.3$quanti
correlation p.value
Mature.Volume 0.6895973 2.719826e-28
nb.of.pieces 0.4991858 1.977210e-13
Diameter 0.1441285 4.667716e-02
Length 0.1439205 4.699812e-02
$Dim.3$quali
R2 p.value
Shape 0.17118941 1.103146e-07
Raw.Material 0.06889656 1.218443e-03
Supplier 0.05972244 3.062463e-03
Finishing 0.02284485 3.687355e-02
$Dim.3$category
Estimate p.value
Shape 4 0.891976408 2.451738e-05
Shape 2 0.009521378 7.736497e-04
PP 0.340469738 8.420171e-04
Supplier C 0.565854559 1.216826e-03
Hot Printing 0.142259253 3.687355e-02
Lacquering -0.142259253 3.687355e-02
ABS -0.361701521 1.247578e-03
Shape 1 -0.587613170 4.809896e-07
Comment the percentage of inertia
Below in the Scree we see the percentage of inertia explained by each component. Over 95% of the variance can be explained with the 3 first synthetic vectors in the PCA. Furthermore, we can see that the variance of the first component is explained in majority by Diameter, Length and weight as expected. In the second and thurd dimension by nb.of.pieces and Mature.Volume.
res.pca$eig
eigenvalue percentage of variance cumulative percentage of variance
comp 1 3.1071215080 62.142430160 62.14243
comp 2 1.0669070766 21.338141532 83.48057
comp 3 0.7768681861 15.537363723 99.01794
comp 4 0.0488056018 0.976112036 99.99405
comp 5 0.0002976274 0.005952549 100.00000
fviz_eig(res.pca, addlabels = TRUE)
res.pca$var$contrib
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Diameter 31.232761 0.06080408 2.673943 16.37238500 4.966011e+01
weight 30.749891 0.07707453 1.371044 67.79961131 2.379254e-03
nb.of.pieces 1.309454 66.55678435 32.075772 0.05771098 2.791480e-04
Mature.Volume 5.458176 33.25770535 61.213010 0.07097644 1.329131e-04
Length 31.249719 0.04763169 2.666232 15.69931627 5.033710e+01
8) Give the R object with the two principal components which are the synthetic variables the most correlated to all the variables.
These are found in the code below -
res.pca$var$coord[,1:2]
Dim.1 Dim.2
Diameter 0.9851090 -0.02547004
weight 0.9774643 -0.02867601
nb.of.pieces -0.2017085 0.84267375
Mature.Volume -0.4118157 -0.59567509
Length 0.9853764 -0.02254298
9) PCA is often used as a pre-processing step before applying a clustering algorithm, explain the rationale of this approach and how many components k you keep.
We chose the maximum number of components as to not to loose any information, but want to discard the last components that can be considered as noise. Consequently, we keep the number of dimensions such that we keep 95% of the inertia in PCA, which is equivalent to 3 components in our analysis.
10) Perfoms a kmeans algorithm on the selected k principal components of PCA. How many cluster are you keeping? Justify.
FINISH THIS QUESTIOM
We us the Elbow method and look at the knee to determine the number of clusters we keep - here
11) Performs an AHC on the selected k principal components of PCA.
Below we perform an AHBC on the select 3 principal components of PCA.
Comments the results and describe precisely one cluster – Add Fisher Test
The cluster 1 is made of individuals sharing : - high values for the variable Mature.Volume. - low values for the variables nb.of.pieces, Price, weight, Length and Diameter (variables are sorted from the weakest).
The cluster 2 is made of individuals sharing : - high values for the variable nb.of.pieces. - low values for the variables Mature.Volume, Diameter, Length, weight and Price (variables are sorted from the weakest).
The cluster 3 is made of individuals such as 89, 90, 131, 161, 163 and 164. This group is characterized by : - high values for the variables Length, Diameter, weight and Price (variables are sorted from the strongest). - low values for the variables nb.of.pieces and Mature.Volume (variables are sorted from the weakest).
If someone ask you why you have selected k components and not k + 1 or k − 1, what is your answer? (could you suggest a strategy to assess the stability of the approach? - are there many differences between the clustering obtained on k components or on the initial data)
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=4)
res.hcpc <- HCPC(res.pca, nb.clust = -1)
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=3)
res.hcpc <- HCPC(res.pca, nb.clust = -1)
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=2)
res.hcpc <- HCPC(res.pca, nb.clust = -1)
res.hcpc <- HCPC(res.pca, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
res.hcpc <- HCPC(res.pca, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
res.hcpc
**Results for the Hierarchical Clustering on Principal Components**
name description
1 "$data.clust" "dataset with the cluster of the individuals"
2 "$desc.var" "description of the clusters by the variables"
3 "$desc.var$quanti.var" "description of the cluster var. by the continuous var."
4 "$desc.var$quanti" "description of the clusters by the continuous var."
5 "$desc.var$test.chi2" "description of the cluster var. by the categorical var."
6 "$desc.axes$category" "description of the clusters by the categories."
7 "$desc.axes" "description of the clusters by the dimensions"
8 "$desc.axes$quanti.var" "description of the cluster var. by the axes"
9 "$desc.axes$quanti" "description of the clusters by the axes"
10 "$desc.ind" "description of the clusters by the individuals"
11 "$desc.ind$para" "parangons of each clusters"
12 "$desc.ind$dist" "specific individuals"
13 "$call" "summary statistics"
14 "$call$t" "description of the tree"
plot.HCPC(res.hcpc, choice = 'map', draw.tree = FALSE, title = '', select=c("12"))
res.pca <- PCA(raw_data,quali.sup = c(1,5,6,7,9),quanti.sup = 10,ncp=3)
res.hcpc <- HCPC(res.pca, nb.clust = -1, graph = FALSE)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
res.hcpc
**Results for the Hierarchical Clustering on Principal Components**
name description
1 "$data.clust" "dataset with the cluster of the individuals"
2 "$desc.var" "description of the clusters by the variables"
3 "$desc.var$quanti.var" "description of the cluster var. by the continuous var."
4 "$desc.var$quanti" "description of the clusters by the continuous var."
5 "$desc.var$test.chi2" "description of the cluster var. by the categorical var."
6 "$desc.axes$category" "description of the clusters by the categories."
7 "$desc.axes" "description of the clusters by the dimensions"
8 "$desc.axes$quanti.var" "description of the cluster var. by the axes"
9 "$desc.axes$quanti" "description of the clusters by the axes"
10 "$desc.ind" "description of the clusters by the individuals"
11 "$desc.ind$para" "parangons of each clusters"
12 "$desc.ind$dist" "specific individuals"
13 "$call" "summary statistics"
14 "$call$t" "description of the tree"
Characterization of each supplier
catdes(raw_data, num.var=1)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
$test.chi2
p.value df
Raw.Material 9.049049e-05 4
Impermeability 1.088731e-02 2
$category
$category$`Supplier A`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PS 42.30769 37.93103 13.61257 0.0002998155 3.615459
Impermeability=Type 2 34.78261 27.58621 12.04188 0.0130149176 2.483361
Shape=Shape 2 26.66667 41.37931 23.56021 0.0213728107 2.301333
Raw.Material=ABS 0.00000 0.00000 10.99476 0.0254288561 -2.234825
Impermeability=Type 1 12.50000 72.41379 87.95812 0.0130149176 -2.483361
$category$`Supplier B`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=ABS 100.00000 14.18919 10.99476 0.003330616 2.935453
Raw.Material=PS 57.69231 10.13514 13.61257 0.015928453 -2.410551
Shape=Shape 2 60.00000 18.24324 23.56021 0.002374481 -3.038894
$category$`Supplier C`
Cla/Mod Mod/Cla Global p.value v.test
Raw.Material=PP 9.722222 100 75.39267 0.01626019 2.403023
$quanti.var
Eta2 P-value
nb.of.pieces 0.2137072 1.530822e-10
$quanti
$quanti$`Supplier A`
NULL
$quanti$`Supplier B`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces -2.817845 3.959459 4.115183 1.240523 1.413225 0.004834708
$quanti$`Supplier C`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 6.345875 6.428571 4.115183 1.720228 1.413225 2.211654e-10
attr(,"class")
[1] "catdes" "list "
catdes(raw_data, num.var=5)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
$test.chi2
p.value df
Impermeability 2.873602e-16 3
Raw.Material 8.762044e-07 6
Finishing 1.040072e-02 3
$category
$category$`Shape 1`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 1 76.785714 99.2307692 87.95812 1.043033e-11 6.800436
Raw.Material=PP 72.222222 80.0000000 75.39267 3.596432e-02 2.097331
Finishing=Lacquering 72.868217 72.3076923 67.53927 4.420603e-02 2.012132
Finishing=Hot Printing 58.064516 27.6923077 32.46073 4.420603e-02 -2.012132
Raw.Material=PS 30.769231 6.1538462 13.61257 3.360691e-05 -4.147538
Impermeability=Type 2 4.347826 0.7692308 12.04188 1.043033e-11 -6.800436
$category$`Shape 2`
Cla/Mod Mod/Cla Global p.value v.test
Impermeability=Type 2 95.65217 48.88889 12.04188 2.151940e-15 7.932260
Raw.Material=PS 69.23077 40.00000 13.61257 1.004609e-07 5.325888
Supplier=Supplier A 41.37931 26.66667 15.18325 2.137281e-02 2.301333
Supplier=Supplier B 18.24324 60.00000 77.48691 2.374481e-03 -3.038894
Raw.Material=PP 16.66667 53.33333 75.39267 2.022645e-04 -3.716171
Impermeability=Type 1 13.69048 51.11111 87.95812 2.151940e-15 -7.932260
$category$`Shape 3`
NULL
$category$`Shape 4`
Cla/Mod Mod/Cla Global p.value v.test
Finishing=Hot Printing 9.677419 75 32.46073 0.0169336 2.388146
Finishing=Lacquering 1.550388 25 67.53927 0.0169336 -2.388146
$quanti.var
Eta2 P-value
Price 0.24285191 2.771217e-11
Diameter 0.23221716 9.994081e-11
Length 0.23112294 1.139178e-10
weight 0.19722569 5.965369e-09
nb.of.pieces 0.10533516 1.120672e-04
Mature.Volume 0.05693699 1.177220e-02
$quanti
$quanti$`Shape 1`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces -3.721118 3.853846 4.115183 1.2716527 1.4132247 1.983430e-04
weight -5.069026 1.418671 1.714121 0.7867181 1.1728539 3.998565e-07
Length -5.469752 8.191771 10.329589 5.0759423 7.8647827 4.506649e-08
Diameter -5.489199 1.026728 1.294639 0.6306647 0.9821218 4.037612e-08
Price -6.344431 14.290942 16.552332 4.8726895 7.1724314 2.232495e-10
$quanti$`Shape 2`
v.test Mean in category Overall mean sd in category Overall sd p.value
Diameter 6.616403 2.143782 1.294639 1.409033 9.821218e-01 3.680436e-11
Length 6.603559 17.116281 10.329589 11.260640 7.864783e+00 4.014035e-11
Price 6.176070 22.340911 16.552332 9.620363 7.172431e+00 6.571699e-10
weight 6.118100 2.651800 1.714121 1.698673 1.172854e+00 9.469770e-10
nb.of.pieces 2.865929 4.644444 4.115183 1.607698 1.413225e+00 4.157871e-03
Mature.Volume -2.005008 58355.222222 82206.026178 68318.473831 9.103190e+04 4.496223e-02
$quanti$`Shape 3`
NULL
$quanti$`Shape 4`
v.test Mean in category Overall mean sd in category Overall sd p.value
nb.of.pieces 3.078997 5.62500 4.115183 9.921567e-01 1.413225 0.002076987
Mature.Volume 2.629122 165250.00000 82206.026178 1.132649e+05 91031.901051 0.008560561
Price 2.044271 21.63988 16.552332 3.822103e+00 7.172431 0.040926790
attr(,"class")
[1] "catdes" "list "
catdes(raw_data, num.var=6)
Chi-squared approximation may be incorrectChi-squared approximation may be incorrectChi-squared approximation may be incorrect
$test.chi2
p.value df
Raw.Material 4.088669e-21 2
Shape 2.873602e-16 3
Supplier 1.088731e-02 2
$category
$category$`Type 1`
Cla/Mod Mod/Cla Global p.value v.test
Shape=Shape 1 99.23077 76.785714 68.06283 1.043033e-11 6.800436
Raw.Material=PP 97.91667 83.928571 75.39267 1.773212e-11 6.723573
Supplier=Supplier A 72.41379 12.500000 15.18325 1.301492e-02 -2.483361
Raw.Material=PS 30.76923 4.761905 13.61257 5.429478e-15 -7.816541
Shape=Shape 2 51.11111 13.690476 23.56021 2.151940e-15 -7.932260
$category$`Type 2`
Cla/Mod Mod/Cla Global p.value v.test
Shape=Shape 2 48.8888889 95.652174 23.56021 2.151940e-15 7.932260
Raw.Material=PS 69.2307692 78.260870 13.61257 5.429478e-15 7.816541
Supplier=Supplier A 27.5862069 34.782609 15.18325 1.301492e-02 2.483361
Raw.Material=PP 2.0833333 13.043478 75.39267 1.773212e-11 -6.723573
Shape=Shape 1 0.7692308 4.347826 68.06283 1.043033e-11 -6.800436
$quanti.var
Eta2 P-value
Diameter 0.47062626 6.604215e-28
Length 0.46804072 1.049429e-27
weight 0.45675032 7.728264e-27
Price 0.43301606 4.512224e-25
Mature.Volume 0.07171395 1.801495e-04
$quanti
$quanti$`Type 1`
v.test Mean in category Overall mean sd in category Overall sd p.value
Mature.Volume 3.691294 91225.988095 82206.026178 9.338486e+04 9.103190e+04 2.231162e-04
Price -9.070449 14.805996 16.552332 4.819967e+00 7.172431e+00 1.185272e-19
weight -9.315716 1.420835 1.714121 6.707159e-01 1.172854e+00 1.211330e-20
Length -9.430150 8.338742 10.329589 4.357114e+00 7.864783e+00 4.095012e-21
Diameter -9.456161 1.045344 1.294639 5.411724e-01 9.821218e-01 3.194554e-21
$quanti$`Type 2`
v.test Mean in category Overall mean sd in category Overall sd p.value
Diameter 9.456161 3.115573 1.294639 1.449522 9.821218e-01 3.194554e-21
Length 9.430150 24.871426 10.329589 11.600832 7.864783e+00 4.095012e-21
weight 9.315716 3.856391 1.714121 1.708742 1.172854e+00 1.211330e-20
Price 9.070449 29.308174 16.552332 8.516118 7.172431e+00 1.185272e-19
Mature.Volume -3.691294 16321.086957 82206.026178 13496.587327 9.103190e+04 2.231162e-04
attr(,"class")
[1] "catdes" "list "
res.hcpc.famd$call$X$Dim.1
[1] -1.98063919 -1.83264787 -1.83106329 -1.78617733 -1.77810618 -1.76793474 -1.76710263 -1.75740608 -1.74971627 -1.74854492 -1.72234966 -1.66587800 -1.64757197 -1.64122934
[15] -1.63598508 -1.56390795 -1.53944334 -1.53334618 -1.48308272 -1.38978669 -1.31431198 -1.31360993 -1.30659672 -1.30106267 -1.24853337 -1.22972350 -1.22095845 -1.21456405
[29] -1.18418458 -1.18021014 -1.17935581 -1.17927296 -1.17545877 -1.16352170 -1.15631274 -1.15536248 -1.14893262 -1.14368481 -1.13904799 -1.11948077 -1.11592461 -1.11166467
[43] -1.10864477 -1.07786759 -1.07465242 -1.07231358 -1.05375226 -1.03073238 -1.01177639 -1.00275909 -1.00244847 -0.99953402 -0.99685597 -0.99557798 -0.99270885 -0.98055457
[57] -0.98041153 -0.97369797 -0.97319443 -0.96309915 -0.94797090 -0.93951350 -0.92381405 -0.92380588 -0.92312060 -0.92280368 -0.92215711 -0.91733190 -0.91127725 -0.90468216
[71] -0.88924490 -0.88886587 -0.88556501 -0.87235273 -0.86644337 -0.86479528 -0.83275001 -0.81147499 -0.77426760 -0.77398705 -0.77256095 -0.76784626 -0.76735493 -0.76500699
[85] -0.75906301 -0.74956767 -0.74385898 -0.72805741 -0.71957456 -0.70612214 -0.67838613 -0.67821432 -0.64991811 -0.64920144 -0.64269013 -0.63822303 -0.63582084 -0.63163680
[99] -0.62753059 -0.62599476 -0.62186560 -0.60480936 -0.58138172 -0.57247016 -0.56818608 -0.55787880 -0.55368923 -0.54139477 -0.53854440 -0.53480620 -0.52898642 -0.52824415
[113] -0.52616669 -0.52564460 -0.52454183 -0.52449868 -0.51962886 -0.51480527 -0.50739799 -0.49352372 -0.48049116 -0.47854325 -0.47511958 -0.45213593 -0.44366084 -0.41417752
[127] -0.40516410 -0.39864924 -0.39340526 -0.38440624 -0.38336093 -0.38001794 -0.37853964 -0.35220981 -0.34431429 -0.32536820 -0.31982441 -0.31803255 -0.30039211 -0.28421095
[141] -0.28417967 -0.27830495 -0.27577096 -0.26247151 -0.25453002 -0.24784546 -0.23849067 -0.23173017 -0.23034051 -0.22769867 -0.15845660 -0.05646856 -0.05377382 0.05889003
[155] 0.07158955 0.07913132 0.16276548 0.79185578 1.06680161 1.09360971 1.14437872 1.21748198 1.25067158 1.37109701 2.03126674 2.20958460 2.25060435 2.48114155
[169] 2.49540111 2.51611986 2.68062272 3.05674906 3.07969012 3.23130366 3.77350671 3.80195508 4.13207845 4.32016658 4.35027614 4.72999750 5.26696337 5.37357116
[183] 5.86802720 5.87862818 6.58628131 6.60222800 6.60582995 7.20144721 7.28664755 7.39353843 7.91577973